342 research outputs found
Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element
The running maximum-minimum (max-min) filter computes the maxima and minima
over running windows of size w. This filter has numerous applications in signal
processing and time series analysis. We present an easy-to-implement online
algorithm requiring no more than 3 comparisons per element, in the worst case.
Comparatively, no algorithm is known to compute the running maximum (or
minimum) filter in 1.5 comparisons per element, in the worst case. Our
algorithm has reduced latency and memory usage.Comment: to appear in Nordic Journal of Computin
A Better Alternative to Piecewise Linear Time Series Segmentation
Time series are difficult to monitor, summarize and predict. Segmentation
organizes time series into few intervals having uniform characteristics
(flatness, linearity, modality, monotonicity and so on). For scalability, we
require fast linear time algorithms. The popular piecewise linear model can
determine where the data goes up or down and at what rate. Unfortunately, when
the data does not follow a linear model, the computation of the local slope
creates overfitting. We propose an adaptive time series model where the
polynomial degree of each interval vary (constant, linear and so on). Given a
number of regressors, the cost of each interval is its polynomial degree:
constant intervals cost 1 regressor, linear intervals cost 2 regressors, and so
on. Our goal is to minimize the Euclidean (l_2) error for a given model
complexity. Experimentally, we investigate the model where intervals can be
either constant or linear. Over synthetic random walks, historical stock market
prices, and electrocardiograms, the adaptive model provides a more accurate
segmentation than the piecewise linear model without increasing the
cross-validation error or the running time, while providing a richer vocabulary
to applications. Implementation issues, such as numerical stability and
real-world performance, are discussed.Comment: to appear in SIAM Data Mining 200
Scale And Translation Invariant Collaborative Filtering Systems
Collaborative filtering systems are prediction algorithms over sparse data sets of user preferences. We modify a wide range of state-of-the-art collaborative filtering systems to make them scale and translation invariant and generally improve their accuracy without increasing their computational cost. Using the EachMovie and the Jester data sets, we show that learning-free constant time scale and translation invariant schemes outperforms other learning-free constant time schemes by at least 3% and perform as well as expensive memory-based schemes (within 4%). Over the Jester data set, we show that a scale and translation invariant Eigentaste algorithm outperforms Eigentaste 2.0 by 20%. These results suggest that scale and translation invariance is a desirable property
Extracting, Transforming and Archiving Scientific Data
It is becoming common to archive research datasets that are not only large
but also numerous. In addition, their corresponding metadata and the software
required to analyse or display them need to be archived. Yet the manual
curation of research data can be difficult and expensive, particularly in very
large digital repositories, hence the importance of models and tools for
automating digital curation tasks. The automation of these tasks faces three
major challenges: (1) research data and data sources are highly heterogeneous,
(2) future research needs are difficult to anticipate, (3) data is hard to
index. To address these problems, we propose the Extract, Transform and Archive
(ETA) model for managing and mechanizing the curation of research data.
Specifically, we propose a scalable strategy for addressing the research-data
problem, ranging from the extraction of legacy data to its long-term storage.
We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201
Faster Base64 Encoding and Decoding Using AVX2 Instructions
Web developers use base64 formats to include images, fonts, sounds and other
resources directly inside HTML, JavaScript, JSON and XML files. We estimate
that billions of base64 messages are decoded every day. We are motivated to
improve the efficiency of base64 encoding and decoding. Compared to
state-of-the-art implementations, we multiply the speeds of both the encoding
(~10x) and the decoding (~7x). We achieve these good results by using the
single-instruction-multiple-data (SIMD) instructions available on recent Intel
processors (AVX2). Our accelerated software abides by the specification and
reports errors when encountering characters outside of the base64 set. It is
available online as free software under a liberal license.Comment: software at https://github.com/lemire/fastbase6
Strongly universal string hashing is fast
We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these families---though they require a large buffer of random numbers---are
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
low-powered processors. Our tests include hash functions designed for
processors with the Carry-Less Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.Comment: Software is available at
http://code.google.com/p/variablelengthstringhashing/ and
https://github.com/lemire/StronglyUniversalStringHashin
Attribute Value Reordering For Efficient Hybrid OLAP
The normalization of a data cube is the ordering of the attribute values. For
large multidimensional arrays where dense and sparse chunks are stored
differently, proper normalization can lead to improved storage efficiency. We
show that it is NP-hard to compute an optimal normalization even for 1x3
chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are
nearly statistically independent, we show that dimension-wise attribute
frequency sorting is an optimal normalization and takes time O(d n log(n)) for
data cubes of size n^d. When dimensions are not independent, we propose and
evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is
already 19%-30% more efficient than ROLAP, but normalization can improve it
further by 9%-13% for a total gain of 29%-44% over ROLAP
Monotonicity Analysis over Chains and Curves
Chains are vector-valued signals sampling a curve. They are important to
motion signal processing and to many scientific applications including location
sensors. We propose a novel measure of smoothness for chains curves by
generalizing the scalar-valued concept of monotonicity. Monotonicity can be
defined by the connectedness of the inverse image of balls. This definition is
coordinate-invariant and can be computed efficiently over chains. Monotone
curves can be discontinuous, but continuous monotone curves are differentiable
a.e. Over chains, a simple sphere-preserving filter shown to never decrease the
degree of monotonicity. It outperforms moving average filters over a synthetic
data set. Applications include Time Series Segmentation, chain reconstruction
from unordered data points, Optical Character Recognition, and Pattern
Matching.Comment: to appear in Proceedings of Curves and Surfaces 200
Tag-Cloud Drawing: Algorithms for Cloud Visualization
Tag clouds provide an aggregate of tag-usage statistics. They are typically
sent as in-line HTML to browsers. However, display mechanisms suited for
ordinary text are not ideal for tags, because font sizes may vary widely on a
line. As well, the typical layout does not account for relationships that may
be known between tags. This paper presents models and algorithms to improve the
display of tag clouds that consist of in-line HTML, as well as algorithms that
use nested tables to achieve a more general 2-dimensional layout in which tag
relationships are considered. The first algorithms leverage prior work in
typesetting and rectangle packing, whereas the second group of algorithms
leverage prior work in Electronic Design Automation. Experiments show our
algorithms can be efficiently implemented and perform well.Comment: To appear in proceedings of Tagging and Metadata for Social
Information Organization (WWW 2007
- âŠ